Gesture tracking for mobile rendered augmented reality

ABSTRACT

An augmented reality (AR) system hosted and executed on a mobile client enables control of AR objects using gestures. The system receives, from the mobile client, an image from a camera view (e.g., the mobile client&#39;s camera) of an environment, where the image depicts a user&#39;s hand. The system applies a machine learning model to the received image. The machine learning model identifies a formation of the hand. The system determines to render an AR object based on the identified formation. For example, a user forming a first with his hand may cause an AR ball to move upward within the screen of the mobile client.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/971,766, filed Feb. 7, 2020, which is incorporated by reference in its entirety.

TECHNICAL FIELD

The disclosure generally relates to the field of mobile rendered augmented reality and more specifically to object control based on gesture tracking in mobile rendered augmented reality environments.

BACKGROUND

Conventional augmented reality (AR) systems use handheld controllers to track a user's hands and determine whether the user is making a hand gesture. However, the tracked gestures are limited to what gestures can be made with a controller in the user's hands. For example, the user cannot make a “five” gesture (i.e., palm open and fingers/thumb extended) without dropping the controller. Mobile devices may execute AR applications without handheld controllers, but execution of the applications slows. This is because a mobile device has limited processing bandwidth and/or limited battery life necessary for intensive processing. Hence, there lacks an AR system that can track user gestures without demanding large processing bandwidth or excessively consuming a device's powers.

BRIEF DESCRIPTION OF DRAWINGS

The disclosed embodiments have other advantages and features which will be more readily apparent from the detailed description, the appended claims, and the accompanying figures (or drawings). A brief introduction of the figures is below.

Figure (FIG. 1 illustrates an augmented reality (AR) system environment, in accordance with at least one embodiment.

FIG. 2 is a block diagram of the gesture tracking application of FIG. 1, in accordance with at least one embodiment.

FIG. 3 is a flowchart illustrating a process for controlling an AR object using gesture detection, in accordance with at least one embodiment.

FIG. 4 is a flowchart illustrating a process for controlling the AR object using gesture detection based on the process of FIG. 3, in accordance with at least one embodiment.

FIGS. 5A and 5B illustrate user interactions with an AR application that integrates gesture tracking to control AR objects, in accordance with at least one embodiment.

FIG. 6 illustrates a block diagram including components of a machine able to read instructions from a machine-readable medium and execute them in a processor (or controller), in accordance with at least one embodiment.

DETAILED DESCRIPTION

The Figures (FIGS.) and the following description relate to preferred embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.

Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.

Configuration Overview

In one example embodiment of a disclosed system, method and computer readable storage medium, an augmented reality (AR) object is rendered on a mobile client based on a user's gestures identified within a camera view of the mobile client. Conventional gesture tracking solutions for AR are designed for handheld hardware to track where a user's hands are and determine if a gesture is being made. Accordingly, described is a configuration that enables gesture tracking to control an AR object in a mobile rendered AR system while optimizing for the power and processing constraints of the mobile client.

In one example configuration, a camera coupled with the mobile client (e.g., integrated with the mobile client or wirelessly or wired connection with the mobile client) captures a camera view of an environment. The environment may correspond to the physical world, which may include a portion of a user's body, that may be positioned with a field of view of the camera. A processor (e.g., of the mobile device) processes program code that causes the processor to execute specified functions as are further described herein. Accordingly, the processor receives an image from the camera view of the environment and applies a machine learning model to the received image. Using the machine learning, the processor identifies a gesture, which may also be referred to herein as a “formation,” made by the hand depicted within the image. The processor determines, based on the identified gesture, a state in which to render an AR object (e.g., from an AR engine). The processor provides for display (e.g., transmits instructions (e.g., program code or software) to render on a screen) the rendered object in the environment (e.g., with the user's hand) to the mobile client. In some embodiments, a user interacts with the displayed AR objects.

In some embodiments, if the processor does not detect a gesture or a hand within a received image, the processor may determine not to store the image into memory or remove the image from memory such that the processor will not expend further processing resources on the frame (e.g., due to its lack of depicting an informative object). In some embodiments, the machine learning model for detecting gestures may require more processing resources than a machine learning model for detecting hands. In these circumstances, the processor may optimize the mobile client's processing resources by applying a hand detection model to an image first rather than directly applying a gesture tracking model.

Gesture tracking allows the user to use a mobile device to control and interact with AR objects without dedicated hardware as though they are interacting with reality around the user, presenting an immersive gaming experience for the user. In particular, the methods described herein allow for AR object control using gesture tracking on a mobile client that does not consume too much processing and/or battery power.

Augmented Reality System Environment

Figure (FIG. 1 illustrates an augmented reality (AR) system environment, in accordance with at least one embodiment. The AR system environment enables AR applications on a mobile client 100, and in some embodiments, presents immersive experiences to users via gesture tracking. The system environment includes a mobile client 100, an AR system 110, an AR engine 120, a gesture tracking application 130, a database 140, and a network 150. The AR system 110, in some example embodiments, may include the mobile client 100, the AR engine 120, the gesture tracking application 130, and the database 140. In other example embodiments, the AR system 110 may include the AR engine 120, the gesture tracking application 130, and the database 140, but not the mobile client 100, such that the AR system 110 communicatively couples (e.g., wireless communication) to the mobile client 100 from a remote server.

The mobile client 100 is a mobile device that is or incorporates a computer. The mobile client may be, for example, a relatively small computing device in which network, processing (e.g., processor and/or controller) and power resources (e.g., battery) may be limited and have a formfactor size such as a smartphone, tablet, wearable device (e.g., smartwatch) and/or a portable internet enabled device. The limitations of such device extend from scientific principles that must be adhered to in designing such products for portability and use away from constant power draw sources.

The mobile client 100 may be a computing device that includes the components of the machine depicted in FIG. 6. The mobile client 100 has general and/or special purpose processors, memory, storage, networking components (either wired or wireless). The mobile client 100 can communicate over one or more communication connections (e.g., a wired connection such as ethernet or a wireless communication via cellular signal (e.g., LTE, 5G), WiFi, satellite) and includes a global positioning system (GPS) used to determine a location of the mobile client 100.

The mobile client 100 also includes one or more cameras 102 that can capture forward and rear facing images and/or videos. To capture images for gesture or hand detection, the camera 102 may be a two-dimensional (2D) camera as opposed to a stereo camera or a three-dimensional (3D) camera. That is, the machine-learned detection described herein does not necessarily require a 3D image or depth to classify a hand or a gesture depicted within an image.

The mobile client 100 also includes a screen (or display) 103 and a display driver to provide for display interfaces on the screen 103 associated with the mobile client 100. The mobile client 100 executes an operating system, such as GOOGLE ANDROID OS and/or APPLE iOS, and includes the screen 103 and/or a user interface that the user can interact with. In some embodiments, the mobile client 100 couples to the AR system 110, which enables it to execute an AR application (e.g., the AR client 101).

The AR engine 120 interacts with the mobile client 100 to execute the AR client 101 (e.g., an AR game). For example, the AR engine 120 may be a game engine such as UNITY and/or UNREAL ENGINE. The AR engine 120 displays, and the user interacts with, the AR game via the mobile client 100. For example, the mobile client 100 may host and execute the AR client 101 that in turn accesses the AR engine 120 to enable the user to interact with the AR game. Although the AR application refers to an AR gaming application in many instances described herein, the AR application may be a retail application integrating AR for modeling purchasable products, an educational application integrating AR for demonstrating concepts within a learning curriculum, or any suitable interactive application in which AR may be used to augment the interactions. In some embodiments, the AR engine 120 is integrated into and/or hosted on the mobile client 100. In other embodiments, the AR engine 120 is hosted external to the mobile client 100 and communicatively couples to the mobile client 100 over the network 150. The AR system 110 may comprise program code that executes functions as described herein.

In some example embodiments, the AR system 110 includes the gesture tracking application 130. The gesture tracking application enables gesture tracking in the AR game such that AR objects (e.g., virtual objects rendered by the AR engine 120) and their behaviors or states may be controlled by the user. The user may capture an image and/or video of an environment captured within a camera view of the camera 102 of the mobile client 100. An image from the camera view may depict a portion of the user's body such as the user's hand. The AR engine 120 renders an AR object, where the rendering may be based on gestures that the gesture tracking application 130 has determined that the user is performing (e.g., a fist). The gesture tracking application 130 may detect or track (e.g., detecting changes in gestures over time) a variety of gestures such as a five (i.e., an open palm facing away from the user), a fist, pointing, waving, facial gestures (e.g., smiles, open mouth, etc.), lifting a leg, bending, kicking, or any suitable movement made by any portion of the body to express an intention. While the gesture tracking application 130 is described herein as primarily tracking hand gestures, the gesture tracking application 130 may detect various gestures as described above. As referred to herein, “gesture” and “formation” may be used interchangeably.

During use of the AR client 101 (e.g., during game play), the gesture tracking application 130 identifies a body part (e.g., a hand) within an image from a camera view captured by the camera 102, determines a gesture made by the body part (e.g., a five), and renders an AR object based on the determined gesture. The gesture tracking application 130 may instruct the camera 102 to capture image frames periodically (e.g., every three seconds). In some embodiments, the state in which the AR engine 120 displays the AR object depends on input from the user during game play. For example, the direction of the AR object's movement may change depending on the user's gestures. FIGS. 5A and 5B, described further herein, provide details on how gesture tracking may be used control AR objects in the AR system 110. In some embodiments, the AR system 110 includes applications instead of and/or in addition to the gesture tracking application 130. In some embodiments, the gesture tracking application 130 may be hosted on and/or executed by the mobile client 100. In other embodiments, the gesture tracking application 130 is communicatively coupled to the mobile client 100.

The database 140 stores images or videos that may be used by the gesture tracking application 130 to detect a user's hand and determine the gesture the hand is making. The mobile client 100 may transmit images or videos collected by the camera 102 during the execution of the AR client 101 to the database 140. The data stored within the database 140 may be collected from a single user (e.g., the user of the mobile client 100) or multiple users (e.g., users of other mobile clients that are communicatively coupled to the AR system 110 through the network 150). The gesture tracking application 130 may use images and/or videos of gestures stored in the database 140 to train a model (e.g., a neural network). In particular, the machine learning model training engine 210 of the gesture tracking application 130 may access the database 140 to train a machine learning model. This is described in further detail in the description of FIG. 2.

The database 140 may store a mapping of gestures to AR objects and/or states in which the AR objects may be rendered. The gesture tracking application 130 may determine a gesture made by a user's hand as captured within a camera view of the mobile client 100 and access the database 140 to determine, using the determined gesture, that an AR object should be rendered in a particular state (e.g., a AR ball floating upward). The database 140 may store one or more user profiles, each user profile including user customizations or settings that personalize the user's experience using the AR client 101. For example, a user profile stored within the database 140 may store a user-specified name of a custom hand gesture, images of the customized hand gesture (e.g., taken by the user using the mobile client 100), and a user-specified mapping of a gesture to the customized hand gesture.

The network 150 transmits data between the mobile client 100 and the AR system 110. The network 150 may be a local area and/or wide area network that uses wired and/or wireless communication systems, such as the internet. In some embodiments, the network 150 includes encryption capabilities to ensure the security of data, such as secure sockets layer (SSL), transport layer security (TLS), virtual private networks (VPNs), internet protocol security (IPsec), etc.

Example Gesture Tracking Application

FIG. 2 is a block diagram of the gesture tracking application 130 of FIG. 1, in accordance with at least one example embodiment. The gesture tracking application 130 includes a machine learning model training engine 210, a gesture tracking module 220, an AR object state module 230, and a rendering module 240. The gesture tracking module 220 further includes, a gesture detection model 221 and a hand detection model 222. In some embodiments, the gesture tracking application 130 includes modules other than those shown in FIG. 2. The modules may be embodied as program code (e.g., software comprised of instructions stored on non-transitory computer readable storage medium and executable by at least one processor such as the processor 602 in FIG. 6) and/or hardware (e.g., application specific integrated circuit (ASIC) chips or field programmable gate arrays (FPGA) with firmware. The modules correspond to at least having the functionality described when executed/operated.

The process of detecting a gesture and modifying the state of an AR object based on the detected gesture may begin with the gesture tracking module 220 receiving an image from the mobile client 100. In one example, the image may be taken by camera 102 and transmitted to the gesture tracking application 130 by the AR client 101 to determine whether a gesture is depicted within the image to control an AR object. The gesture tracking module 220 may apply one or more of trained models such as the gesture detection model 221 and the hand detection model 222 to determine the gesture depicted in the received image. The models may be trained by the machine learning model training engine 210. After classifying the gesture (e.g., determining the user is making a five), the gesture tracking module 220 may provide the classification to the AR object state module 230, which subsequently determines a state in which an AR object should be rendered. The AR object state module 230 provides the determined state to the rendering module 240, which may request an AR object in a particular state from the AR engine 120 and provide the AR object received from the AR engine 120 at the screen 103 of the mobile client 100.

The machine learning model training engine 210 applies training data sets to the gesture detection model 221 or the hand detection model 222. The training engine 210 may create training data sets based on data from the database 140. The training data sets may include positive or negative samples of hand gestures or hands. The training data sets may be labeled according to the presence, or lack thereof, of a hand gesture or hand. The labels may be provided to the training engine 210 from a user (e.g., using a user input interface of the mobile client 100). This may enable the training engine 210 to train, for example, the gesture detection model 221 to classify an image of a custom user gesture according to a user-specified label.

For the gesture detection model 221 to classify customized gestures, the machine learning model training engine 210 may create a training data set using images of a custom gesture. The gesture tracking module 220 may prompt (e.g., through the screen 103) the user to make a custom gesture and instruct the user to capture multiple images (e.g., in different positions or angles) of the gesture using the camera 102. The gesture tracking module 220 receives the captured images and transmit them to the database 140 for storage. The machine learning model training engine 210 may use the captured images to train the gesture detection model 221 or the hand detection model 222.

The machine learning model training engine 210 may train a machine learning model in multiple stages. In a first stage, the training engine 210 may use generalized data representing hands or hand gestures taken from multiple users. In a second stage, the training engine 210 may use user-specific data representing the hand or hand gestures of the user of mobile client 100 to further optimize the gesture tracking performed by the gesture tracking module 220 to a user.

The gesture detection model 221 and the hand detection model 222 may be machine learning models. The models 221 and 222 may use various machine learning techniques such as linear support vector machine (linear SVM), boosting for other algorithms (e.g., AdaBoost), neural networks, logistic regression, naïve Bayes, memory-based learning, random forests, bagged trees, decision trees, boosted trees, boosted stumps, a supervised or unsupervised learning algorithm, or any suitable combination thereof.

The gesture detection model 221 classifies gestures within images or videos. The gesture detection model 221 is trained using various images of one or more hand gestures (e.g., by machine learning model training engine 210). In some embodiments, data representing an image or video captured by the camera 102 is input into the gesture detection model 221. The gesture detection model 221 classifies one or more objects within the image or video. For example, the gesture detection model 221 classifies a hand making a five within an image whose data is input into the model. In some embodiments, the gesture detection model 221 detects a gesture performed over time (e.g., a hand wave). For example, consecutively captured images are input into the gesture detection model 221 to determine that the images represent a five at varying positions within the camera view, which the gesture tracking module 220 may then classify as a hand wave.

The hand detection model 222 classifies the presence of hands within images or videos. The hand detection model 222 is trained using various images of hands (e.g., by machine learning model training engine 210). Similar to the gesture detection model 221, data representing an image or video captured by the camera 102 may be input into the hand detection model 222. The hand detection model 222 classifies an object within the image or video as a hand or outputs an indication that no hand was detected. In some embodiments, the hand detection model 222 uses clusters of feature points identified from the camera view, where the feature points are associated with a physical object in the environment and the cluster corresponds to a physical surface of the object (e.g., a hand). The hand detection model 222 may use the feature points (e.g., a combination of three-dimensional Cartesian coordinates and the corresponding depths) to identify clusters corresponding to a hand or to what is not a hand.

The gesture tracking module 220 may use the output of the gesture detection model 221 or the hand detection model 222 to generate instructions or prompts for display to the user (e.g., via the screen 103 of the mobile client 100). The instructions may include a suggested movement that the user should make or position that the user should be in within the camera view of the mobile client 100 to detect a hand gesture. For example, the gesture tracking module 220 may generate an outline of a hand or hand gesture on the screen 103 to guide a user to form a hand gesture. The camera 102 captures images of the user's hand as it aligns with the displayed guide. The captured images may be input into the gesture detection model 221 or the hand detection model 222. If a gesture or hand is not detected within the images, the gesture tracking module 220 may generate a notification at the mobile client 100 that the detection failed, instructions to guide the user to make a desired hand gesture, or a combination thereof.

One or more of the gesture detection model 221 or the hand detection model 222 may output a classification based on a success threshold. The models may determine that the likelihood of a classification (e.g., a confidence score) of a particular gesture or of the presence of hand must meet or exceed a success threshold. For example, the gesture detection model 221 determines that a “five” gesture is being made with 60% likelihood and determines, based on a success threshold of 90%, that there is not a “five” gesture present in the input image or video. In another example, the hand detection model 222 determines that a hand is present with 95% likelihood and determines, based on a success threshold of 80%, that there is indeed a hand present in the input image or video. The success threshold for each model of the gesture tracking module 220 may be different. In some embodiments, a success threshold may be user-specified or adjustable. For example, a user may lower the success threshold used by the models of the gesture tracking module 220 to increase the likelihood that his hand gesture will be identified. In this example, the user may increase the flexible afforded when the user's hand is not placed within the camera view at a sufficiently proper angle or position, and a model may be more likely to determine the user's gesture as one that the user is indeed making.

In some embodiments, the gesture detection model 221 and the hand detection model 222 may be one model. For example, the gesture detection model 221 may be used to both detect the presence of a hand within an image and classify the gesture made by the present hand. In some embodiments, use of the hand detection model 222 may improve processing bandwidth and thus, power consumption, of the mobile client 100 by enabling the gesture tracking application 130 (e.g., the gesture tracking module 220) to discard image frames of videos where a hand is not present. Video processing consumes a large amount of a mobile client's processing bandwidth, and by discarding frames that do not have information relevant to the gesture tracking application 130, subsequent processing of image frames are reserved for images that contain valuable information (e.g., a hand gesture being made).

The application of both the hand detection model 222 and the gesture detection model 221 may reduce the processing cycles and power consumption of the mobile client 100. In particular, applying the hand detection model 222 to an image before applying the gesture detection model may save processing cycles. In some embodiments, the gesture detection model 221 may be able to classify a large number of gestures (e.g., both still and moving gestures) and thus, requires complex processing with each application to an image in order to determine which of the many gestures is present within the image. By contrast, the hand detection model 222 may be simpler, as detecting a hand by its outline may be simpler than determining the gesture being made. Accordingly, the initial application of the hand detection model 222 that requires less processing bandwidth prior to the subsequent application, if the hand has been detected by the model 222, of the gesture detection model 221 that requires more processing bandwidth allows the mobile client 100 to reserve its processing resources.

The AR object state module 230 may hold in memory the current state of the AR object or previous states of the AR objects to determine a subsequent state based on the user's hand gesture and the current or previous states of the AR object. A conditional decision algorithm may be used by the AR object state module 230 to determine a state to instruct the rendering module 240 to render the AR object in. The AR Object State Module 240 may receive custom mappings of states for AR objects and hand gestures from the user. For example, a user may specify that an AR object corresponding to a shield is to be rendered during game play when a user waves his hand and that the rendering module 240 should stop rendering the shield when a user waves his hand again. States of an AR object and the application of the AR object state module 230 in determining those states using detected hand gestures is further described in the descriptions of FIGS. 4, 5A, and 5B.

The rendering module 240 provides for display, on the mobile client 100, an augmented reality (AR) object that may be controlled by a user's gestures. The AR engine 120 generates the AR object. When the AR object is at a location on a virtual coordinate space that represents the surfaces within the environment captured by the camera view of the camera 102, the rendering module 240 displays the AR object in a state based on a detected hand gesture. As referred to herein, a “state” of an AR object refers to a position, angle, shape, or any suitable condition of appearance that may change over time. For example, the rendering module 240 may render a ball in a first state where the ball is not aflame or in a second state where the ball is aflame. In another example, the rendering module 240 may render the ball in a first state where the ball is located at a first set of Cartesian coordinates and a first depth in a virtual coordinate space corresponding to the environment or in a second state where the ball is located at a second set of Cartesian coordinates and a second depth in the virtual coordinate space.

Processes for Controlling Ar Objects Using Gesture Detection

FIG. 3 is a flowchart illustrating a process 300 for controlling an AR object using gesture detection, in accordance with at least one example embodiment. The process 300 may be performed by the gesture tracking application 130. The gesture tracking application 130 may perform operations of the process 300 in parallel or in different orders, or may perform different, additional, or fewer steps. For example, prior to receiving 302 the image, the gesture tracking application 130 may generate for display on the mobile client an instruction to position a user's hand within a certain area of the camera view of the mobile client (e.g., in order to successfully capture the hand within the image).

The gesture tracking application 130 receives 302, from a mobile client, an image from a camera view of an environment, where the image depicts a portion of a body of a user. The gesture tracking module 220 may perform the receiving 302. For example, the mobile client 100 provides an image of a user's hand, from the camera view captured by the camera 102, within an environment of the user's living room.

The gesture tracking application 130 provides 304 the image to a machine learning model configured to identify a formation of the portion of the body. For example, the gesture tracking module 220 may apply the image data of the received 302 image to the gesture detection model 221. The gesture detection model 221 may identify a formation being made by the hand within the image and classify the formation into one of multiple potential formations (e.g., as specified by a user through labels corresponding to the potential formations that were used to train the gesture detection model 221).

The gesture tracking application 130 provides 306 for display on the mobile client, based on an identification of the formation by the machine learning model, an AR object in the camera view of the environment. The rendering module 240 may perform the providing 306. The gesture tracking application 130 may determine a state that the AR object is to be rendered in. For example, the AR object state module 230 uses a conditional decision tree that indicates that, if a particular formation is identified, then the rendering module 240 is to display a corresponding AR object and in a corresponding state.

FIG. 4 is a flowchart illustrating a process 400 for controlling the AR object using gesture detection based on the process 300 of FIG. 3, in accordance with at least one example embodiment. The process 400 includes subprocesses of the process 300. Like the process 300, the process 400 may be performed by the gesture tracking application 130. The gesture tracking application 130 may perform operations of the process 400 in parallel or in different orders, or may perform different, additional, or fewer steps. For example, after determining 406 what formation was identified, the gesture tracking application 130 may access a previous formation identified and determine, using both the previous and the current formation, a state in which to generate the AR object.

The gesture tracking application 130 receives 402, from a mobile client, an image from a camera view of an environment, the image depicting a hand of a user. The hand captured within the received 402 image is one example of a portion of the body of the user received 302 in the process 300.

The gesture tracking application 130 applies 404 a machine learning model to the image, the machine learning model trained on training image data representative of hand formations, the machine learning model configured to identify a formation of the hand in the image as one of the hand formations. The machine learning model applied 404 is one example of a machine learning model applied 304 in the process 300. The machine learning model may be a convolutional neural network trained (e.g., by the machine learning model training engine 210) using various images of hands making particular formations (e.g., fists and fives). In one example, if the image received 402 depicts the hand making a fist, the machine learning model may output a classification that the identified formation is a first (e.g., palm covered over by fingers curled in).

The gesture tracking application 130 determines 406 whether a first or a second hand formation is identified by the machine learning model. The gesture tracking application may make this determination using a mapping between formations and AR object states, as described in the description of the AR object state module 230. If a first hand formation is identified, the gesture tracking application 130 provides 408, for display on the mobile client, the AR object in a first state from the AR engine in the camera view of the environment. If a second hand formation is identified, the gesture tracking application 130 provides 410, for display on the mobile client, the AR object in a second state from the AR engine in the camera view of the environment. To provide the AR object for display in a particular state, the gesture tracking application may use the rendering module 240 which receives instructions to render a particular AR object in a particular state from the AR object state module 230 and transmits instructions to the AR engine 120 to render the AR object accordingly at the screen 103 of the mobile client 100. The determination 406 and either provided 408 or 410 AR object for display may be one example of the provided 306 AR object for display of the process 300.

Example AR Application with Gesture Tracking to Control AR Objects

FIGS. 5A and 5B illustrate user interactions with an AR application that integrates gesture tracking to control AR objects, in accordance with at least one embodiment. FIG. 5A shows a first user interaction 500 a where a user's hand 510 a is in a first state and is captured, by the mobile client 100, within an image (e.g., camera-captured hand 510 b). The gesture tracking application 130 renders an AR object 520 (e.g., a ball) for display on the mobile client 100 such that the AR object 520 appears integrated into the environment with the user's hand 510 a captured within the camera view.

In the first user interaction 500 a, the AR object 520 is rendered for display (e.g., by the rendering module 240) in a first state where it is at a position overlaying the user's hand. The gesture tracking application 130 may determine a position within a virtual coordinate space that the user's hand 510 a is located and determine a corresponding location to render the AR object 520 (e.g., a set of coordinates causing the AR object 520 to appear above the user's hand 510 a).

The gesture tracking application 130 provides a gesture indicator 530 for display that indicates the gesture identified by the gesture tracking application (e.g., by the gesture detection model 221). In the first user interaction 500 a, the gesture indicator 530 indicates that the user's hand 510 a is making a five. The gesture tracking application 130 may highlight, circle, or otherwise visually distinguish a gesture indicator from other indicators to inform the user of presently detected gestures within the camera view.

FIG. 5B shows a second user interaction 500 b where the user's hand 510 a is in a second state and is captured, by the mobile client 100, within an image (e.g., camera-captured hand 510 b). The gesture tracking application 130 provides a gesture indicator 540 for display that indicates the gesture identified by the gesture detection model 221 is a first during the second user interaction 500 b. The gesture tracking application 130 renders the AR object 520 for display in a second state where the object appears to move upward within the camera view, as indicated by the state change indicator 550 included in FIG. 5B for clarity and not necessarily rendered by the gesture tracking application 130 for display to the user. In some embodiments, the gesture tracking application 130 may use a previous state and the present gesture of a user's hand to determine a subsequent state in which to render an AR object (e.g., using a mapping table or conditional decision tree). For example, the gesture tracking application 130 determines that a combination of the identified first and the existing position of the AR object 520 over the user's hand indicates that the next state in which the AR object is to be rendered is appearing to move upward.

Computing Machine Architecture

FIG. 6 is a block diagram illustrating components of an example machine able to read instructions from a machine-readable medium and execute them in a processor (or controller). Specifically, FIG. 6 shows a diagrammatic representation of a machine in the example form of a computer system 600 within which program code (e.g., software) for causing the machine to perform any one or more of the methodologies discussed herein may be executed. The program code may correspond to functional configuration of the modules and/or processes described with FIGS. 1-5B. The program code may be comprised of instructions 624 executable by one or more processors 602. In alternative embodiments, the machine operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.

The machine may be a portable computing device or machine (e.g., smartphone, tablet, wearable device (e.g., smartwatch)) capable of executing instructions 624 (sequential or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute instructions 624 to perform any one or more of the methodologies discussed herein.

The example computer system 600 includes a processor 602 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these), a main memory 604, and a static memory 606, which are configured to communicate with each other via a bus 608. The computer system 600 may further include visual display interface 610. The visual interface may include a software driver that enables displaying user interfaces on a screen (or display). The visual interface may display user interfaces directly (e.g., on the screen) or indirectly on a surface, window, or the like (e.g., via a visual projection unit). For ease of discussion the visual interface may be described as a screen. The visual interface 610 may include or may interface with a touch enabled screen. The computer system 600 may also include alphanumeric input device 612 (e.g., a keyboard or touch screen keyboard), a cursor control device 614 (e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument), a storage unit 616, a signal generation device 618 (e.g., a speaker), and a network interface device 620, which also are configured to communicate via the bus 608.

The storage unit 616 includes a machine-readable medium 622 on which is stored instructions 624 (e.g., software) embodying any one or more of the methodologies or functions described herein. The instructions 624 (e.g., software) may also reside, completely or at least partially, within the main memory 604 or within the processor 602 (e.g., within a processor's cache memory) during execution thereof by the computer system 600, the main memory 604 and the processor 602 also constituting machine-readable media. The instructions 624 (e.g., software) may be transmitted or received over a network 626 via the network interface device 620.

While machine-readable medium 622 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., instructions 624). The term “machine-readable medium” shall also be taken to include any medium that is capable of storing instructions (e.g., instructions 624) for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein. The term “machine-readable medium” includes, but not be limited to, data repositories in the form of solid-state memories, optical media, and magnetic media.

Additional Configuration Considerations

While using an AR application, a user may desire a more seamless experience by controlling AR objects with his hands (e.g., using hand gestures), as the user is accustomed to doing in reality. While tracking hand gestures for conventional AR systems achieve the user's desire with dedicated, handheld controllers, a wired power source, and enough hardware real-estate to accommodate for powerful, power-hungry processors, mobile clients do not share those similar specifications to afford gesture tracking in that conventional manner. Rather, a mobile device is limited in its power and processing resources. The embodiments herein optimize for a mobile device's power and processing constraints by limiting image frames processed during the gesture detection (e.g., discarding image frames that a machine learning model determines does not depict a hand). Thus, the methods described herein enable gesture tracking for AR object control on mobile client rendered AR systems without consuming excessive amounts of processing power and present an immersive AR experience to the user.

Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.

Certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules (e.g., code embodied on a machine-readable medium or in a transmission signal) or hardware modules. A hardware module is tangible unit capable of performing certain operations and may be configured or arranged in a certain manner. In example embodiments, one or more computer systems (e.g., a standalone, client or server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.

In various embodiments, a hardware module may be implemented mechanically or electronically. For example, a hardware module may comprise dedicated circuitry or logic that is permanently configured (e.g., as a special-purpose processor, such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC)) to perform certain operations. A hardware module may also comprise programmable logic or circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.

Accordingly, the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. As used herein, “hardware-implemented module” refers to a hardware module. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where the hardware modules comprise a general-purpose processor configured using software, the general-purpose processor may be configured as respective different hardware modules at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.

Hardware modules can provide information to, and receive information from, other hardware modules. Accordingly, the described hardware modules may be regarded as being communicatively coupled. Where multiple of such hardware modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) that connect the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times, communications between such hardware modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example, one hardware module may perform an operation and store the output of that operation in a memory device to which it is communicatively coupled. A further hardware module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).

The various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions. The modules referred to herein may, in some example embodiments, comprise processor-implemented modules.

Similarly, the methods described herein may be at least partially processor-implemented. For example, at least some of the operations of a method may be performed by one or processors or processor-implemented hardware modules. The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the processor or processors may be located in a single location (e.g., within a home environment, an office environment or as a server farm), while in other embodiments the processors may be distributed across a number of locations.

The one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors), these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., application program interfaces (APIs).)

The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the one or more processors or processor-implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other example embodiments, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations.

Some portions of this specification are presented in terms of algorithms or symbolic representations of operations on data stored as bits or binary digital signals within a machine memory (e.g., a computer memory). These algorithms or symbolic representations are examples of techniques used by those of ordinary skill in the data processing arts to convey the substance of their work to others skilled in the art. As used herein, an “algorithm” is a self-consistent sequence of operations or similar processing leading to a desired result. In this context, algorithms and operations involve physical manipulation of physical quantities. Typically, but not necessarily, such quantities may take the form of electrical, magnetic, or optical signals capable of being stored, accessed, transferred, combined, compared, or otherwise manipulated by a machine. It is convenient at times, principally for reasons of common usage, to refer to such signals using words such as “data,” “content,” “bits,” “values,” “elements,” “symbols,” “characters,” “terms,” “numbers,” “numerals,” or the like. These words, however, are merely convenient labels and are to be associated with appropriate physical quantities.

Unless specifically stated otherwise, discussions herein using words such as “processing,” “computing,” “calculating,” “determining,” “presenting,” “displaying,” or the like may refer to actions or processes of a machine (e.g., a computer) that manipulates or transforms data represented as physical (e.g., electronic, magnetic, or optical) quantities within one or more memories (e.g., volatile memory, non-volatile memory, or a combination thereof), registers, or other machine components that receive, store, transmit, or display information.

As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.

Some embodiments may be described using the expression “coupled” and “connected” along with their derivatives. It should be understood that these terms are not intended as synonyms for each other. For example, some embodiments may be described using the term “connected” to indicate that two or more elements are in direct physical or electrical contact with each other. In another example, some embodiments may be described using the term “coupled” to indicate that two or more elements are in direct physical or electrical contact. The term “coupled,” however, may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments are not limited in this context.

As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).

In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.

Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for gesture tracking in an augmented reality environment executed on a mobile client through the disclosed principles herein. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the disclosed embodiments are not limited to the precise construction and components disclosed herein. Various modifications, changes and variations, which will be apparent to those skilled in the art, may be made in the arrangement, operation and details of the method and apparatus disclosed herein without departing from the spirit and scope defined in the appended claims. 

What is claimed is:
 1. A non-transitory computer readable storage medium comprising stored instructions, the instructions when executed by a processor cause the processor to: receive from a mobile client an image from a camera view of an environment, the image depicting a hand of a user; apply a machine learning model to the image, the machine learning model trained on training image data representative of a plurality of hand formations, the machine learning model configured to identify a formation of the hand in the image as one of the plurality of hand formations; provide for display on the mobile client, responsive to identification of the formation of the hand as a first hand formation of the plurality of hand formations, an augmented reality (AR) object in a first state from an AR engine in the camera view of the environment; and provide for display on the mobile client, responsive to the formation of the hand identified as a second hand formation of the plurality of hand formations, the AR object in a second state from the AR engine in the camera view of the environment.
 2. The non-transitory computer readable storage medium of claim 1, wherein the machine learning model is a first machine learning model, and wherein the instructions further comprise instructions that when executed by the processor cause the processor to identify a location of the hand, the instructions to identify location of the hand further comprising instructions that when executed by the processor cause the processor to: apply a second machine learning model to the received image, the second machine learning model configured to classify real-world objects in the environment, the real-world objects including the hand; receive a plurality of feature points associated with the real-world objects in the environment; generate a three-dimensional (3D) virtual coordinate space based the plurality of feature points; and identify, based on a classification of the hand by the second machine learning model, the location of the hand associated with corresponding coordinates in the generated 3D virtual coordinate space.
 3. The non-transitory computer readable storage medium of claim 2, wherein the AR engine rendered object is provided for display based on the identified location.
 4. The non-transitory computer readable storage medium of claim 2, wherein the instructions further comprise instructions that when executed by the processor cause the processor to remove data of a given image from memory responsive to the second machine learning model classifying an absence of any hand depicted within the given image.
 5. The non-transitory computer readable storage medium of claim 1, wherein the instructions further comprise instructions that when executed by the processor cause the processor to train the machine learning model using the training image data, the instruction to train the machine learning model further comprising instructions that when executed by the processor cause the processor to: receive a plurality of images of the plurality of hand formations; and apply a respective label to each of the plurality of images of the plurality of hand formations, the training image data comprising the labeled plurality of images.
 6. The non-transitory computer readable storage medium of claim 5, wherein each respective label corresponds to a computer executable command.
 7. The non-transitory computer readable storage medium of claim 5, wherein the plurality of hand formations includes a user-customized hand formation.
 8. The non-transitory computer readable storage medium of claim 7, wherein the instructions further comprise instructions that when executed by the processor cause the processor to prompt the user to provide a user-specified state of the AR engine rendered object, wherein an identification of the user-customized hand formation by the machine learning model indicates that the AR engine rendered object is to be provided for display in the user-specified state.
 9. The non-transitory computer readable storage medium of claim 1, wherein the machine learning model is further configured to output a confidence score associated with the identified formation of the hand, and wherein providing for display the AR engine rendered object in the first state or in the second state is further responsive to the confidence score exceeding a threshold confidence score.
 10. The non-transitory computer readable storage medium of claim 9, wherein the instructions further comprise instructions that when executed by the processor cause the processor to receive a selection of a user-specified threshold confidence score.
 11. The non-transitory computer readable storage medium of claim 1, wherein the AR engine is a game engine.
 12. A computer system comprising: a gesture tracking module configured to: receive from a mobile client an image from a camera view of an environment, the image depicting a hand of a user; and apply a machine learning model to the image, the machine learning model trained on training image data representative of a plurality of hand formations, the machine learning model configured to identify a formation of the hand in the image as one of the plurality of hand formations; and a rendering module of the gesture tracking application configured to: provide for display on the mobile client, responsive to identification of the formation of the hand as a first hand formation of the plurality of hand formations, an augmented reality (AR) object in a first state from an AR engine in the camera view of the environment; and provide for display on the mobile client, responsive to the formation of the hand identified as a second hand formation of the plurality of hand formations, the AR object in a second state from the AR engine in the camera view of the environment.
 13. The system of claim 12, wherein the machine learning model is a first machine learning model, wherein the gesture tracking module is further configured to identify a location of the hand within the image by being further configured to: apply a second machine learning model to the received image, the second machine learning model configured to classify real-world objects in the environment, the real-world objects including the hand; receive a plurality of feature points associated with the real-world objects in the environment; generate a three-dimensional (3D) virtual coordinate space based the plurality of feature points; and identify, based on a classification of the hand by the second machine learning model, the location of the hand associated with corresponding coordinates in the generated 3D virtual coordinate space.
 14. The system of claim 13, wherein the AR engine rendered object is provided for display based on the identified location.
 15. The system of claim 13, wherein the gesture tracking module is further configured to remove data of a given image from memory responsive to classification of an absence of any hand depicted within the given image by the second machine learning model.
 16. A computer-implemented method comprising: receiving from a mobile client an image from a camera view of an environment, the image depicting a hand of a user; applying a machine learning model to the image, the machine learning model trained on training image data representative of a plurality of hand formations, the machine learning model configured to identify a formation of the hand in the image as one of the plurality of hand formations; providing for display on the mobile client, responsive to identification of the formation of the hand as a first hand formation of the plurality of hand formations, an augmented reality (AR) object in a first state from an AR engine in the camera view of the environment; and providing for display on the mobile client, responsive to the formation of the hand identified as a second hand formation of the plurality of hand formations, the AR object in a second state from the AR engine in the camera view of the environment.
 17. The computer-implemented method of claim 16, wherein the machine learning model is a first machine learning model, further comprising identifying a location of the hand within the image by: applying a second machine learning model to the received image, the second machine learning model configured to classify real-world objects in the environment, the real-world objects including the hand; receiving a plurality of feature points associated with the real-world objects in the environment; generating a three-dimensional (3D) virtual coordinate space based the plurality of feature points; and identifying, based on a classification of the hand by the second machine learning model, the location of the hand associated with corresponding coordinates in the generated 3D virtual coordinate space.
 18. The computer-implemented method of claim 16, wherein the AR engine rendered object is provided for display based on the identified location.
 19. The computer-implemented method of claim 16, further comprising removing data of a given image from memory responsive to the second machine learning model classifying an absence of any hand depicted within the given image.
 20. A computer-implemented method comprising: receiving from a mobile client an image from a camera view of an environment, the image depicting a portion of a body of a user; providing the image to a machine learning model configured to identify a formation of the portion of the body in the image; and providing for display on the mobile client, based on an identification of the formation by the machine learning model, an augmented reality (AR) object in the camera view of the environment. 